Digitization and Search, A Non-Traditional Use of HPC
نویسندگان
چکیده
We describe our efforts in developing an open source cyberinfrastructure to provide a form of automated search of handwritten content within large digitized document archives. Such collections are a treasure trove of data ranging from decades ago to as far as the present. The information contained in these collections is also very relevant to both researchers who might extract numerical or statistical data from such sources as well as the general public. With the push to digitize our paper archives we are, however , faced with the fact that though these digital versions are easier to share, they are not trivially searchable as the digitiza-tion process produces image data and not text. This inability to find and/or identify contents within these collections makes this data largely unusable without a lengthy and costly manual transcription process carried out by human beings. To carry out the search we build on top of a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on a large archive, three computationally expensive pre-processing steps are required, Figure 1. First, forms are segmented into individual units of handwritten information. In the case of the 1930 Census data collection, which contains approximately 3.6 million spreadsheet-like forms, this entails breaking the form images into sub-images of individual cells that contain the information about the individuals recorded in the Census. Second, the extracted sub-images are processed so as to extract features and descriptors that represent the handwritten contents within them. The utilized word spotting method results in a 30 dimensional vector derived from the frequency components of the darker ink pixels [1]. The distance between two such signature vectors can be used to determine how similar the handwritten contents of their cell sub-images are. Third, an indexing step organizes these extracted signatures into a binary tree structure to enable fast user queries. For the 1930 Census data this involves organizing nearly 7 billion sub-images using a hierarchical agglomerative clustering. Organizing the entire collection at once isn't practical, thus we instead break this step into multiple index construction steps based on states, categories, and microfilm reels passing the …
منابع مشابه
From Traditional to Digital Environment: An Analysis of the Evolution of Business Models and New Marketing Strategies
This paper analyzes the major trends in the business environment that shaped the business models adopted by companies and their new marketing strategies. It adopts a desktop research methodology by collecting data from previous academic papers, statistical, and analytical reports. It starts by analyzing the globalization trend that forced most of the emerging economies to liberalize and privati...
متن کاملDigitization and Path Disruption: An Examination in the Funeral Industry
While the digitization of the business landscape provides firms with numerous business opportunities, it has severely disrupted established business practices of many traditional “offline businesses.” To shed light on the disruptive nature of digitization and the challenges that it entails for traditional offline businesses, we draw on path dependence theory to examine how digitization disrupts...
متن کاملEducation, the Key to Success in Non-Pharmacological Interventions in the Control and Treatment of Type 2 Diabetes: A Systematic Review
Background: The prevalence of diabetes 2 is a global health challenge that requires continuous care. The use of non- Pharmaceutical interventions in the control and treatment of type-2 diabetes can be less costly and have less complications. Therefore, this study aimed to identify a variety of non- Pharmaceutical interventions in the control and treatment of type-2 diabetes through systematic r...
متن کاملباکتریهای هتروتروف در آب آشامیدنی شهر تبریز
Background and Aim: Recently the use of heterotrophic plate count (HPC) has received much attention as a supplementary indicator of the MPN test in water quality control. The US Environmental Protection Agency (USEPA) has declared 500 cfu/mL as the maximum acceptable level for heterotrophic bacteria in distribution networks. Currently the HPC determination is not among the routine control items...
متن کاملThe perceptibility curve test applied to CCD and two methods of digitization of dental film-based radiographs
Objectives: Several methods of image acquisition are accessible in dentistry. There is no overall acceptable method for image digitization so all different types of images can be comparable. The objective of this study was to compare the diagnostic accuracy of different methods of image digitization. Methods: This accuracy diagnostic test study used perceptibility curve test which first intr...
متن کاملIterated Local Search Algorithm for the Constrained Two-Dimensional Non-Guillotine Cutting Problem
An Iterated Local Search method for the constrained two-dimensional non-guillotine cutting problem is presented. This problem consists in cutting pieces from a large stock rectangle to maximize the total value of pieces cut. In this problem, we take into account restrictions on the number of pieces of each size required to be cut. It can be classified as 2D-SLOPP (two dimensional single large o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012